271 research outputs found

    Pairing Software-Managed Caching with Decay Techniques to Balance Reliability and Static Power in Next-Generation Caches

    Get PDF
    Since array structures represent well over half the area and transistors on-chip, maintaining their ability to scale is crucial for overall technology scaling. Shrinking transistor sizes are resulting in increased probabilities of single events causing single- and multi-bit upsets which require adoption of more complex and power hungry error detection and correction codes (ECC) in hardware. At the same time, SRAM leakage energy is increasing partly due to technology trends and partly due to the increasing number of transistors present. This paper proposes and evaluates methods of reducing the static power requirements of caches, while also maintaining high reliability. In particular, we propose methods of applying reduced ECC techniques to data that has been identified (by programmer or compiler) as error-tolerant. This segregation, in turn, makes both the default data and the error-tolerant data more amenable to decay-based techniques for leakage control. We examine the potential of this split memory hierarchy along several dimensions. In particular, we consider the power and reliability issues inherent in the approach. Overall, we show that our approach allows the ECC requirements of future applications and caches to be met while also reducing leakage energy

    QDB: From Quantum Algorithms Towards Correct Quantum Programs

    Get PDF
    With the advent of small-scale prototype quantum computers, researchers can now code and run quantum algorithms that were previously proposed but not fully implemented. In support of this growing interest in quantum computing experimentation, programmers need new tools and techniques to write and debug QC code. In this work, we implement a range of QC algorithms and programs in order to discover what types of bugs occur and what defenses against those bugs are possible in QC programs. We conduct our study by running small-sized QC programs in QC simulators in order to replicate published results in QC implementations. Where possible, we cross-validate results from programs written in different QC languages for the same problems and inputs. Drawing on this experience, we provide a taxonomy for QC bugs, and we propose QC language features that would aid in writing correct code

    TransForm: Formally Specifying Transistency Models and Synthesizing Enhanced Litmus Tests

    Full text link
    Memory consistency models (MCMs) specify the legal ordering and visibility of shared memory accesses in a parallel program. Traditionally, instruction set architecture (ISA) MCMs assume that relevant program-visible memory ordering behaviors only result from shared memory interactions that take place between user-level program instructions. This assumption fails to account for virtual memory (VM) implementations that may result in additional shared memory interactions between user-level program instructions and both 1) system-level operations (e.g., address remappings and translation lookaside buffer invalidations initiated by system calls) and 2) hardware-level operations (e.g., hardware page table walks and dirty bit updates) during a user-level program's execution. These additional shared memory interactions can impact the observable memory ordering behaviors of user-level programs. Thus, memory transistency models (MTMs) have been coined as a superset of MCMs to additionally articulate VM-aware consistency rules. However, no prior work has enabled formal MTM specifications, nor methods to support their automated analysis. To fill the above gap, this paper presents the TransForm framework. First, TransForm features an axiomatic vocabulary for formally specifying MTMs. Second, TransForm includes a synthesis engine to support the automated generation of litmus tests enhanced with MTM features (i.e., enhanced litmus tests, or ELTs) when supplied with a TransForm MTM specification. As a case study, we formally define an estimated MTM for Intel x86 processors, called x86t_elt, that is based on observations made by an ELT-based evaluation of an Intel x86 MTM implementation from prior work and available public documentation. Given x86t_elt and a synthesis bound as input, TransForm's synthesis engine successfully produces a set of ELTs including relevant ELTs from prior work.Comment: *This is an updated version of the TransForm paper that features updated results reflecting performance optimizations and software bug fixes. 14 pages, 11 figures, Proceedings of the 47th Annual International Symposium on Computer Architecture (ISCA

    SignalGuru: Leveraging mobile phones for collaborative traffic signal schedule advisory

    Get PDF
    While traffic signals are necessary to safely control competing flows of traffic, they inevitably enforce a stop-and-go movement pattern that increases fuel consumption, reduces traffic flow and causes traffic jams. These side effects can be alleviated by providing drivers and their onboard computational devices (e.g., vehicle computer, smartphone) with information about the schedule of the traffic signals ahead. Based on when the signal ahead will turn green, drivers can then adjust speed so as to avoid coming to a complete halt. Such information is called Green Light Optimal Speed Advisory (GLOSA). Alternatively, the onboard computational device may suggest an efficient detour that will save the driver from stops and long waits at red lights ahead. This paper introduces and evaluates SignalGuru, a novel software service that relies solely on a collection of mobile phones to detect and predict the traffic signal schedule, enabling GLOSA and other novel applications. Our SignalGuru leverages windshield-mounted phones to opportunistically detect current traffic signals with their cameras, collaboratively communicate and learn traffic signal schedule patterns, and predict their future schedule. Results from two deployments of SignalGuru, using iPhones in cars in Cambridge (MA, USA) and Singapore, show that traffic signal schedules can be predicted accurately. On average, SignalGuru comes within 0.66s, for pre-timed traffic signals and within 2.45s, for traffic-adaptive traffic signals. Feeding SignalGuru's predicted traffic schedule to our GLOSA application, our vehicle fuel consumption measurements show savings of 20.3%, on average.National Science Foundation (U.S.). (Grant number CSR-EHS-0615175)Singapore-MIT Alliance for Research and Technology Center. Future Urban Mobilit

    Noise-Adaptive Compiler Mappings for Noisy Intermediate-Scale Quantum Computers

    Full text link
    A massive gap exists between current quantum computing (QC) prototypes, and the size and scale required for many proposed QC algorithms. Current QC implementations are prone to noise and variability which affect their reliability, and yet with less than 80 quantum bits (qubits) total, they are too resource-constrained to implement error correction. The term Noisy Intermediate-Scale Quantum (NISQ) refers to these current and near-term systems of 1000 qubits or less. Given NISQ's severe resource constraints, low reliability, and high variability in physical characteristics such as coherence time or error rates, it is of pressing importance to map computations onto them in ways that use resources efficiently and maximize the likelihood of successful runs. This paper proposes and evaluates backend compiler approaches to map and optimize high-level QC programs to execute with high reliability on NISQ systems with diverse hardware characteristics. Our techniques all start from an LLVM intermediate representation of the quantum program (such as would be generated from high-level QC languages like Scaffold) and generate QC executables runnable on the IBM Q public QC machine. We then use this framework to implement and evaluate several optimal and heuristic mapping methods. These methods vary in how they account for the availability of dynamic machine calibration data, the relative importance of various noise parameters, the different possible routing strategies, and the relative importance of compile-time scalability versus runtime success. Using real-system measurements, we show that fine grained spatial and temporal variations in hardware parameters can be exploited to obtain an average 2.92.9x (and up to 1818x) improvement in program success rate over the industry standard IBM Qiskit compiler.Comment: To appear in ASPLOS'1

    Using LLMs to Facilitate Formal Verification of RTL

    Full text link
    Formal property verification (FPV) has existed for decades and has been shown to be effective at finding intricate RTL bugs. However, formal properties, such as those written as SystemVerilog Assertions (SVA), are time-consuming and error-prone to write, even for experienced users. Prior work has attempted to lighten this burden by raising the abstraction level so that SVA is generated from high-level specifications. However, this does not eliminate the manual effort of reasoning and writing about the detailed hardware behavior. Motivated by the increased need for FPV in the era of heterogeneous hardware and the advances in large language models (LLMs), we set out to explore whether LLMs can capture RTL behavior and generate correct SVA properties. First, we design an FPV-based evaluation framework that measures the correctness and completeness of SVA. Then, we evaluate GPT4 iteratively to craft the set of syntax and semantic rules needed to prompt it toward creating better SVA. We extend the open-source AutoSVA framework by integrating our improved GPT4-based flow to generate safety properties, in addition to facilitating their existing flow for liveness properties. Lastly, our use cases evaluate (1) the FPV coverage of GPT4-generated SVA on complex open-source RTL and (2) using generated SVA to prompt GPT4 to create RTL from scratch. Through these experiments, we find that GPT4 can generate correct SVA even for flawed RTL, without mirroring design errors. Particularly, it generated SVA that exposed a bug in the RISC-V CVA6 core that eluded the prior work's evaluation.Comment: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessibl

    Magic-State Functional Units: Mapping and Scheduling Multi-Level Distillation Circuits for Fault-Tolerant Quantum Architectures

    Full text link
    Quantum computers have recently made great strides and are on a long-term path towards useful fault-tolerant computation. A dominant overhead in fault-tolerant quantum computation is the production of high-fidelity encoded qubits, called magic states, which enable reliable error-corrected computation. We present the first detailed designs of hardware functional units that implement space-time optimized magic-state factories for surface code error-corrected machines. Interactions among distant qubits require surface code braids (physical pathways on chip) which must be routed. Magic-state factories are circuits comprised of a complex set of braids that is more difficult to route than quantum circuits considered in previous work [1]. This paper explores the impact of scheduling techniques, such as gate reordering and qubit renaming, and we propose two novel mapping techniques: braid repulsion and dipole moment braid rotation. We combine these techniques with graph partitioning and community detection algorithms, and further introduce a stitching algorithm for mapping subgraphs onto a physical machine. Our results show a factor of 5.64 reduction in space-time volume compared to the best-known previous designs for magic-state factories.Comment: 13 pages, 10 figure

    Resource Optimized Quantum Architectures for Surface Code Implementations of Magic-State Distillation

    Full text link
    Quantum computers capable of solving classically intractable problems are under construction, and intermediate-scale devices are approaching completion. Current efforts to design large-scale devices require allocating immense resources to error correction, with the majority dedicated to the production of high-fidelity ancillary states known as magic-states. Leading techniques focus on dedicating a large, contiguous region of the processor as a single "magic-state distillation factory" responsible for meeting the magic-state demands of applications. In this work we design and analyze a set of optimized factory architectural layouts that divide a single factory into spatially distributed factories located throughout the processor. We find that distributed factory architectures minimize the space-time volume overhead imposed by distillation. Additionally, we find that the number of distributed components in each optimal configuration is sensitive to application characteristics and underlying physical device error rates. More specifically, we find that the rate at which T-gates are demanded by an application has a significant impact on the optimal distillation architecture. We develop an optimization procedure that discovers the optimal number of factory distillation rounds and number of output magic states per factory, as well as an overall system architecture that interacts with the factories. This yields between a 10x and 20x resource reduction compared to commonly accepted single factory designs. Performance is analyzed across representative application classes such as quantum simulation and quantum chemistry.Comment: 16 pages, 14 figure
    • …
    corecore